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Abstract 

Social science literature has shown a strong connection 
between the visual appearance of a city's neighborhoods 
and the behavior and health of its citizens. Yet, this re- 
search is limited by the lack of methods that can be used 
to quantify the appearance of streetscapes across cities or 
at high enough spatial resolutions. In this paper, we de- 
scribe 'Streetscore', a scene understanding algorithm that 
predicts the perceived safety of a streetscape, using training 
data from an online survey with contributions from more 
than 7000 participants. We first study the predictive power 
of commonly used image features using support vector re- 
gression, finding that Geometric Texton and Color His- 
tograms along with GIST are the best performers when it 
comes to predict the perceived safety of a streetscape. Us- 
ing Streetscore, we create high resolution maps of perceived 
safety for 21 cities in the Northeast and Midwest of the 
United States at a resolution of 200 images/square mile, 
scoring ~7 million images from Google Streetview. These 
datasets should be useful for urban planners, economists 
and social scientists looking to explain the social and eco- 
nomic consequences of urban perception. 



1. Introduction 

How does the appearance of a neighborhood impact the 
health and behavior of the individuals that inhabit them? 
During the last decades numerous research efforts have ex- 
plored this question. This research has shown an associa- 
tion between neighborhood disorder and criminal behavior 
through the well-known 'broken windows theory' [ , ], 
but also an association between neighborhood disorder and 
health outcomes, such as the spread of STDs [z], the inci- 
dence of obesity [ ], and rates of female alcoholism [' ]. 

Until recently, most data on the physical appearance of 
urban environments was based on low throughput surveys 
[15, 1^]. More recently, however, online data collections 
methods where humans evaluate images, using experts [ ] 
or crowdsourcing [ ], have increased the availability of ur- 



ban perception data, but not to the extent needed to create 
global maps. In fact, even the most ambitious crowdsourc- 
ing efforts [ ] have a limited throughput, being able only 
to rank images from a handful of cities at a resolution of less 
than 10 images per square mile. These constraints limit the 
possibility of using survey based methods to create global 
maps of urban perception. 

The good news about crowdsourced studies is that they 
provide an ideal training dataset for machine learning meth- 
ods building on scene understanding literature in computer 
vision. A trained algorithm, in turn, can be used to create 
high resolution maps of urban perception at required spatial 
and geographical scales. 

Here, we demonstrate that a predictor trained using 
generic image features and the scores of perceived safety 
from a crowdsourced study can accurately predict the safety 
scores of streetscapes not used in the training dataset. We 
evaluate the predictive power of different image features 
commonly used for scene understanding and show that this 
predictor can be used to extend existing datasets to images 
and cities for which no evaluative data is available. 

2. Image Ranking using Trueskill 

We use publicly available data from the crowdsourced 
study by Salesses et al [22] to train our algorithm. In 
this study, participants in an online game were shown two 
Google Streetview images randomly chosen from the cities 
of New York, Boston, Linz and Salzburg (fig. l-(a)). Par- 
ticipants were asked to choose one of the two images in re- 
sponse to the question: 'Which place looks safer?'. 7, 872 
unique participants from 91 countries ranked 4, 109 images 
using 208, 738 pairwise comparisons (or 'clicks'). 

We convert these preferences to a ranked score for each 
image using the Microsoft Trueskill algorithm [ ]. Trueskill 
uses a Bayesian graphical model to rate players competing 
in online games. In our case, we consider each image to 
be the winner of a two-player contest when it is selected in 
response to a question over another image. Each player's 
skill is modeled as a A^(/i, cr^) random variable, which gets 
updated after every contest. The update equations for a two- 
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Figure 1. We convert the pairwise image comparisons obtained 
from a crowdsourced study (a) by Salesses et al. [ ] to a ranked 
score using Trueskill [ ]. (b) Trueskill converges to a stable score 
after ~16 clicks in our case, (c) The images are ranked on their 
perceived safety (qs) between 0 and 10. 



player contest between players x and y, in which x wins 
against y, are as follows - 
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where N{iix-, crt) ^{l^y> ^y) trueskills of x and 
y. The pre-defined constant {3 represents a per-game vari- 
ance, and £ is the empirically estimated probability that 
two players will tie. Functions f {6) = J\f{9)/^{0) and 
g{6) = /(^)-(/(6>)+6>) are defined using the Normal proba- 
bility density function N{9) and Normal cumulative density 
function ^{0). Following [ ], we use (/i = 25, cr = 25/3) 
as initial values for rankings for all images and choose 
/3 = 25/3 and 6 = 0.1333. 

In a two-player contest, Trueskill converges to a stable 
estimate of ji after 12 to 36 contests [ ]. Figure l-(b) shows 
that in our case, Trueskill estimates for ji are fairly stable 
after 16 clicks per image (we have on average, 35.92 clicks 
per image). Finally, we scale the scores to a range between 
0 and 10 and denote these perceptual image scores, or 'Q- 
scores', mathematically as qg. Going forward, we focus on 
the US and only use the 2920 images from New York and 
Boston. 



3. Training a Computational Model 

One would be inclined to believe that people's percep- 
tion of safety is highly subjective. However, Salesses et 
al [ ] demonstrate that the results obtained from their study 
are not driven by biases in age, gender or location of the 
participants, but by differences in the visual attributes of im- 
ages. Hence creating a computational model for perceived 
safety based only on image features is feasible. 

We draw upon previous work on image aesthetics whose 
goal is to predict the perceived beauty of photographs. Early 
research [4, 16] in this area was based on features extracted 
using rules from photography theory and psychology. Re- 
cently, however, task-independent generic image features 
have been found to be more effective for this purpose [18]. 
Similarly, we choose generic image features (e.g. [27, 14]) 
over rule-based features because it is not possible to create 
an exhaustive list of features for a set of images that is open 
and unknown. 

Next, we describe our feature extraction process. 

3.1. Image Feature Extraction 

Figure l-(c) shows eight images from our training 
dataset sorted from low to high scores. The typical high 
scoring image contains suburban houses with manicured 
lawns and streets lined with trees; while the typical low 
scoring image contains empty streets, fences, and industrial 
buildings. To predict their perceived safety, the image fea- 
tures need to capture this variance in appearance. Following 
Xiao et al. [^' ], we extract multiple generic image features 
which were shown to perform well for semantic scene clas- 
sification, as summarized in Table 1 . Specifically we extract 
GIST, Geometric Classification Map, Texton Histograms, 
Geometric Texton Histograms, Color Histograms, Geomet- 
ric Color Histograms, HOG2x2, Dense SIFT, LBP, Sparse 
SIFT histograms, and SSIM. 

Once we compute image features we train a predictor for 
the perceived safety of images {qs) using Support Vector 
Regression as explained below. 

3.2. Prediction using Support Vector Regression 

To predict an image's perceived safety, we choose v- 
Support Vector Regression (i^-SVR) [ ]. Given input fea- 
ture vectors x and their corresponding labels y, the goal of 
Support Vector Regression (SVR) with a linear kernel is to 
obtain a regression function /(x) that approximates y, such 
that. 
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SVR tries to control both the training error and the model 
complexity, by minimizing the following function - 



Feature 


Computation Procedure 


GIST [ ] 


Filterbank outputs (8 orientations at 5 different scales) are averaged on a 5 x 5 grid. 


HOG2x2 [ ] 


124-dimensional 2x2 HOG descriptors are quantized into 300 visual words. Three-level 
spatial pyramid histograms are constructed and compared using histogram intersection. 


Dense SIFT [i 3] 


SIFT descriptors are extracted in a flat window at two scales on a regular grid at steps 
of 5 pixels. The descriptors, stacked together for each HSV color channel, are quantized 
into 300 visual words. Spatial pyramid histograms are used for kernels. 


LBP[ ] 


Histograms of local binary patterns. 


Sparse SIFT [ ] 


SIFT features at Hessian-affine interest points are clustered into dictionaries of 1,000 
visual words. Two histograms of soft-assigned SIFT features are computed. 


SSIM [24] 


Correlation map of a 5 x 5 patch in a 40 pixel window is quantized into 3 radial and 10 
angular bins. Spatial histograms constructed from 300 visual words of descriptors. 


Texton Histograms 


A 512-dimension histogram is built from a textonmap [ ] of each image obtained from 
a universal texton dictionary. 


Color Histograms 


A joint 4 X 4 X 14 histogram of color in CIELab color space for each image. 


Geometric Classification Map 


Histograms of geometric class probabilities [ ] for ground, vertical, porous and sky. 


Geometric Texton Histograms 


Texton Histograms for each geometric class weighed by class probabilities. 


Geometric Color Histograms 


Color histograms for each geometric class weighed by class probabilities. 



Table 1 . Computation procedure for image features used to predict perceived safety. 
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where the value of e — chosen a priori — determines the 
desired accuracy of approximation. 

The key idea of i^-SVR [ ] is that by choosing u in- 
stead of committing to a specific value of e we guarantee 
that the number of predictions with an error more than e is 
smaller than v. This is achieved by solving the following 
constrained minimization problem - 



minimize ^ ||w||^ 



subject to (^(v^-Xi) -^h^ -yi<e^^i 
y,- ((wx,) + 6) <e + C 
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e > 0, C > 0 



3.3. Performance Evaluation 

We now evaluate the performance of these features by 
training an SVR for each feature. We use the Coefficient 
of Determination {R^) between true scores qs and predicted 
scores qs to evaluate the accuracy of a regression model. 
is a quantitative measure for the proportion of total variance 
of true data explained by the prediction model. It is defined 
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We train the SVR using libsvm [ ] and choose the linear 
kernel with following parameters: , C = 0.01, and v = 0.5. 
We determine the optimal C and u using a grid search to 
minimize the R^ of prediction over 5 -fold cross-validation. 

The performance of individual features for Q-score pre- 
diction is summarized in Fig. 2-(a). Geometric Texton His- 
tograms perform the best (R'^ = 0.4826), followed by 
GIST = 0.4339). It is important to note that HOG2x2 
and Dense SIFT, which are the top performing features for 
generic scene classification ], do not perform as well, 
with = 0.3841 and R'^ = 0.4283 respectively. A com- 
bination of all features gives i?^ = 0.5676. 

Feature Selection : Our goal is to develop a trained pre- 
dictor that can generate a very high resolution dataset of 
people's perception of urban environments. To reduce the 
computational cost of feature extraction from new images, 
we sought to choose the best three features using feature 
selection. We use 'forward selection' for this purpose. In 
this method, starting with an empty set, we iteratively add 
one feature at a time, which increases R'^ the most, over 5- 
fold cross-validation. This process gives us Geometric Tex- 
ton Histograms, GIST and Geometric Color Histograms, 
in that order, as our three best features, with a combined 
R'^ = 0.5365 (Fig. 2-(b)). It is interesting to note that Geo- 
metric Color Histograms provide the most performance im- 
provement after Geometric Texton Histograms and GIST, 
even though the individual performance of Geometric Color 



(a) Regression Performance of all features 



(b) Regression Performance of 'Streetscore' 



(c) Classification Performance of 'Streetscore' 




Figure 2. We analyze the performance of commonly used image features (a) for predicting perceived safety (qs). Choosing the best 
performing features using forward selection, we train the 'Streetscore' predictor and analyze its performance for both regression (b) and 
binary classification (c). 



Histograms is not impressive (R^ = 0.3103). This shows 
that colors are an important dimension in predicting per- 
ceived safety as they provide information that is different 
from the one contained in textons and GIST. 

We refer to a predictor trained using Geometric Texton 
Histograms, GIST and Geometric Color Histograms as the 
'Streetscore' predictor and analyze its performance for bi- 
nary classification as additional performance evaluation. 

Binary Classification : For binary classification we use 
'low' (ti) and 'high' (th) thresholds to label images in our 
test set as 'unsafe' or 'safe' according to their Q- scores. We 
study classification accuracy as a function of ts = th — U, 
where ti = Qs — ts/2 and = Qs -\- ts/2 and Qs is the 
average Qs of the test set (Fig. 2-(c)). For ts = 0 (i.e. ti = 
th = Qs) the accuracy of our classifier is 78.42%, whereas 
for ts = 2 ■ cFq^ the accuracy is as high as 93.49%, over 
a 5— fold cross-validation. We note that the accuracy of a 
random predictor would be 50%. 

The robust performance of Streetscore in both classifica- 
tion and regression on this challenging dataset demonstrates 
that the computed features have good explanatory power for 
prediction of perceived safety. In the next section, we ana- 
lyze the performance of Streetscore in different geographi- 
cal regions of the United States. 

4. Perception Maps 

To use Streetscore for creating global maps of urban per- 
ception, we need to determine the generalization perfor- 
mance of the predictor in terms of geographical distance, 
that is, determine the spatial range for which we expect 
the predictions to hold. Are images from New York and 
Boston good enough to predict the perceived safety of im- 
ages from all over the world? Or is the predictor trained 
using these images applicable only in a limited geograph- 
ical region around New York and Boston? Intuitively, the 
external validity of the predictor would depend on the simi- 
larity between architectural styles and urban planning of the 
cities being evaluated. There are, however, no quantitative 



studies on measuring similarities along these axes. There- 
fore, we use the average median income of a city as a naive 
metric to validate the accuracy of Streetscore for cities not 
in the original crowdsourced study. 

First, we use StreetScore to create perception maps for 
27 cities across the United States using images densely sam- 
pled from Google Streetview at 200 images/square mile 
(Fig. 3 -(a)), scoring more than 1 million images. Then, us- 
ing 9 cities which lie inside a 200 mile radius of New York 
City , we compute a linear fit {Lqi) between the mean Q- 
score {qs) of a city and its average median family income 
(/c) according to the 2010 census (Fig. 3 -(b)). We find a 
strong correlation between the two variables (Pearson Cor- 
relation Coefficient = 0.81). 

This linear regression helps us determine some loose 
bounds on the geographical region over which the algorithm 
can be applied given the current training set. Figure 3 -(c) 
shows a plot the residual from Lqi of cities as a function of 
their distance to New York City. This exercise shows that 
the mean absolute error (MAE) is large for cities in Arizona, 
California and Texas, indicating that the algorithm does not 
perform well in these regions. Based on this regression we 
propose a tentative bound of 1100 miles for the validity of 
our trained predictor. Certainly we do not expect this radius 
of validity to be the same in different parts of the world, 
where architectural gradients might be different. Neverthe- 
less, even if these ranges vary by as much as a factor of 
two, they would indicate that an algorithm trained with a 
few images from a given area can be used to map a signif- 
icant area of its surroundings, which is useful to know for 
future crowdsourced studies. 

Figure 5 shows perception maps for six cities created us- 
ing Streetscore. Our goal is to create such maps for ev- 
ery city in the Northeast and Midwest of United States and 
make them available through an interactive website. This 
dataset should help urban planners, economists and social 
scientists who are increasingly using data-driven techniques 
to study the social and economic consequences of urban en- 
vironments. 




Figure 3. We evaluate the generalization ability of Streetscore in terms of geographical distance by scoring streetscapes from 27 cities 
across the United States (a). By using a linear fit (b) between the mean Q-score (qs) of a city and its average median family income (/c), 
we calculate the residual for cities as we move farther away from New York (c) and establish a loose bound of 1100 miles from New York 
for our predictor. 




5. Discussion and Limitations 

Research in visual analysis beyond semantics, such as 
interestingness [ ], beauty [ ] and memorability [ ], has 
been a recent topic of interest in computer vision. Our 
work helps expand this research by focusing on the eval- 
uative dimensions of streetscapes, such as their perceived 
safety, liveliness and character. As a tool, Streetscore can 
help urban planners construct better cities. Yet, Streetscore 
also points to new research directions for the vision com- 
munity. For instance, a fruitful area of research would in- 
volve identifying the objects and features that help explain 
the evaluative dimensions of a streetscape. Also, by classi- 
fying streetscapes evaluatively, it should be possible to ad- 
vance data-driven rendering techniques that are tuned based 
on derived evaluative criteria (such as a place that is lively 
but harmonious). 

Our results also show some important limitations. Since 
it is challenging to encode all the variance of the rich and 
complex visual world in a limited training dataset, the pre- 
dictor fails when evaluating images with unusual visual ele- 
ments, such as atypical architecture and stylistic elements 
like colorful graffiti (Fig. 4). Online learning methods 
which harness the information from user interaction to im- 
prove predictors can be used to overcome this limitation. 

6. Conclusion 

The visual appearance of urban environments can have 
strong effects on the lives of those who inhabit them. Hence 
it is important to understand the impact of architectural 
and urban-planning constructs on people's perception of 
streetscapes. 

However, quantitative surveys of visual perception of 
streetscapes are challenging to conduct and as a result, ex- 
isting surveys are limited to a sparse sampling of data from 
a few cities. In this paper, we demonstrate that a machine 
learning method trained using image features from a small 
dataset of images from New York and Boston labeled by a 
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Figure 4. Failure cases: Streetscore can produce significant errors 
when evaluating images with rare visual elements not encountered 
in the training set, such as colorful graffiti, overpasses and modern 
architecture, (qs - True Q-score, qs - Predicted Q-score) 

crowd, can be used to create 'perception maps' of 21 cities 
from United States at a resolution of 200 images/square 
mile. 

In conclusion, we present a novel computational tool for 
measuring perceived safety of cities which should inspire 
more research in quantitative analysis of cities and their im- 
pact on its inhabitants. In particular, further research on 
impact of different visual elements of streetscapes on per- 
ception can directly influence the urban planning and archi- 
tecture community. 
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Figure 5. Perception maps for 6 cities at 200 images/square mile. An online interactive dataset of perception will be a useful tool for 
scientists studying social and economic impact of urban environments. (Note : the maps are at different scales) 



